近年来,由于深度学习解决复杂的“物理”问题,近年来,基于计算的热管理方法的兴起已经取得了巨大的关注,否则难以使用常规技术难以接近。电子系统需要热管理,以防止它们过热和燃烧,提高其效率和寿命。长期以来,已经采用了数值技术来帮助热管理电子产品。但是,他们带来了一些限制。为了提高传统数值方法的有效性和解决传统方法所面临的缺点,研究人员在热管理过程的各个阶段使用人工智能。本研究详细讨论了“电子”热管理领域深度学习的当前用途。
translated by 谷歌翻译
今天,AI社区沉迷于“最先进的”分数(神经潜逃中的80%论文)作为主要的性能指标,因为哪个重要参数,即环境指标,仍然没有报告。计算能力是十年前的限制因素;然而,在可预见的未来环境中,挑战将是开发环境友好型和高效的算法。人类大脑已经优化了近一百万年,消耗了与典型的笔记本电脑相同的功率。因此,开发自然启发算法是一种解决方案。在这项研究中,我们表明目前使用的Ann是我们在自然界中找到的,虽然具有较低的性能,但尖叫哺乳动物视觉皮质的神经网络引起了很多兴趣。我们进一步突出了限制研究人员的硬件差距,从大规模使用基于尖峰的计算来开发神经形态节能微芯片。使用神经形态处理器而不是传统的GPU可能更加环保和高效。这些处理器将SNN转为问题的理想解决方案。本文介绍了深入的关注,突出了目前的差距,缺乏比较研究,同时提出了两个领域的新研究方向 - 神经科学和深度学习。此外,我们为报告AI模型的碳足迹来定义新的评估度量“性质”。
translated by 谷歌翻译
Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.
translated by 谷歌翻译
The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches.
translated by 谷歌翻译
Machine learning and deep learning-based decision making has become part of today's software. The goal of this work is to ensure that machine learning and deep learning-based systems are as trusted as traditional software. Traditional software is made dependable by following rigorous practice like static analysis, testing, debugging, verifying, and repairing throughout the development and maintenance life-cycle. Similarly for machine learning systems, we need to keep these models up to date so that their performance is not compromised. For this, current systems rely on scheduled re-training of these models as new data kicks in. In this work, we propose to measure the data drift that takes place when new data kicks in so that one can adaptively re-train the models whenever re-training is actually required irrespective of schedules. In addition to that, we generate various explanations at sentence level and dataset level to capture why a given payload text has drifted.
translated by 谷歌翻译
合作感知的一个主要挑战是加重从各种来源进行的测量,以获得准确的结果。理想情况下,权重应与传感信息中的误差成反比。但是,自动驾驶汽车的先前合作传感器融合方法使用固定的误差模型,其中传感器的协方差及其识别器管道只是所有感应场景的测量协方差的平均值。本文提出的方法使用关键预测术语估算错误,这些术语与传感和定位精度具有很高的相关性,以准确地协方差估计每个传感器观察。我们采用一个分层融合模型,该模型由局部和全球传感器融合步骤组成。在局部融合水平上,我们使用每个传感器的误差模型和测量距离添加协方差生成阶段,以生成每个观察值的预期协方差矩阵。在全球传感器融合阶段,我们添加了一个额外的阶段,以从关键预测器项速度产生定位协方差矩阵,并将其与局部融合产生的协方差相结合,以准确地进行合作感应。为了展示我们的方法,我们构建了一组1/10比例模型自动驾驶汽车,具有准确的感应功能,并针对运动捕获系统对误差特性进行了分类。结果表明,当使用我们的误差模型与典型的固定误差模型时,在四车协作融合方案中分别检测到1.42倍和1.78倍的车辆位置时,RMSE的平均水平和最大改善。
translated by 谷歌翻译
强化学习(RL)算法有望为机器人系统实现自主技能获取。但是,实际上,现实世界中的机器人RL通常需要耗时的数据收集和频繁的人类干预来重置环境。此外,当部署超出知识的设置超出其学习的设置时,使用RL学到的机器人政策通常会失败。在这项工作中,我们研究了如何通过从先前看到的任务中收集的各种离线数据集的有效利用来应对这些挑战。当面对一项新任务时,我们的系统会适应以前学习的技能,以快速学习执行新任务并将环境返回到初始状态,从而有效地执行自己的环境重置。我们的经验结果表明,将先前的数据纳入机器人增强学习中可以实现自主学习,从而大大提高了学习的样本效率,并可以更好地概括。
translated by 谷歌翻译
离线增强学习(RL)可以从静态数据集中学习控制策略,但是像标准RL方法一样,它需要每个过渡的奖励注释。在许多情况下,将大型数据集标记为奖励可能会很高,尤其是如果人类标签必须提供这些奖励,同时收集多样的未标记数据可能相对便宜。我们如何在离线RL中最好地利用这种未标记的数据?一种自然解决方案是从标记的数据中学习奖励函数,并使用它标记未标记的数据。在本文中,我们发现,也许令人惊讶的是,一种简单得多的方法,它简单地将零奖励应用于未标记的数据可以导致理论和实践中的有效数据共享,而无需学习任何奖励模型。虽然这种方法起初可能看起来很奇怪(并且不正确),但我们提供了广泛的理论和经验分析,说明了它如何摆脱奖励偏见,样本复杂性和分配变化,通常会导致良好的结果。我们表征了这种简单策略有效的条件,并进一步表明,使用简单的重新加权方法扩展它可以进一步缓解通过使用不正确的奖励标签引入的偏见。我们的经验评估证实了模拟机器人运动,导航和操纵设置中的这些发现。
translated by 谷歌翻译
尽管经过过度公路化,但通过监督学习培训的深网络易于优化,表现出优异的概括。解释这一点的一个假设是,过正交的深网络享有随机梯度下降引起的隐含正规化的好处,这些梯度下降引起的促进解决方案概括了良好的测试输入。推动深度加强学习(RL)方法也可能受益于这种效果是合理的。在本文中,我们讨论了监督学习中SGD的隐式正则化效果如何在离线深度RL设置中有害,导致普遍性较差和退化特征表示。我们的理论分析表明,当存在对时间差异学习的现有模型的隐式正则化模型时,由此产生的衍生规则器有利于与监督学习案件的显着对比的过度“混叠”的退化解决方案。我们凭经验备份这些发现,显示通过引导训练的深网络值函数学习的特征表示确实可以变得堕落,别名出在Bellman备份的两侧出现的状态操作对的表示。要解决此问题,我们派生了这个隐式规范器的形式,并通过此推导的启发,提出了一种简单且有效的显式规范器,称为DR3,抵消了本隐式规范器的不良影响。当与现有的离线RL方法结合使用时,DR3大大提高了性能和稳定性,缓解了ATARI 2600游戏,D4RL域和来自图像的机器人操作。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译